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1. Scope of the lectures 

These lectures will address two questions. Is there a simple variational principle under- 
lying the existence of secondary motifs in the native state of proteins? Is there a general 
approach which can qualitatively capture the salient features of the folding process and 
which may be useful for interpreting and guiding experiments? Here, we present three 
different approaches to the first question, which demonstrate the key role played by the 
topology of the native state of proteins. The second question pertaining to the folding 
dynamics of proteins remains a challenging problem - a detailed description capturing 
the interactions between amino acids among each other and with the solvent is a daunting 
task. We address this issue building on the lessons learned in tackling the first question 
and apply the resulting method to the folding of various proteins including HIV protease 
and membrane proteins. The results that will be presented open a fascinating perspec- 
tive: the two questions appear to be intimately related. The variety of results reported 
here all provide evidence in favour of the special criteria adopted by nature in the selec- 
tion of viable protein folds, ranging from optimal compactness to maximum dynamical 
and geometrical accessibility of the native states. 
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dini, Gianni Settanni and Antonio Trovato who have contributed to the results discussed 
in these lectures. 

2. — Introduction 

A fascinating and open question challenging biochemistry, physics and even geometry is 
the presence of highly regular motifs such as a-heliccs and /3-sheets in the folded state of 
biopolymers and proteins. Stimulating explanations ranging from chemical propensity 
to simple geometrical reasoning have been invoked to rationalize the existence of such 
secondary structures. 

The realization that proteins have secondary structures arose with early crystallographic 
studies and the brilliant deduction of Pauling et al. [1] of the ability of an a-helix of 
the correct pitch to accomodate hydrogen bonds, thus promoting its stability. Inspired 
by the findings of Pauling, hclix-coil transition models have been used to study the 
thermodynamics of helix formation [2]. It is interesting to note, however, that the number 
of hydrogen bonds is nearly the same when a sequence is in an unfolded structure in the 
presence of a polar solvent or in its native state rich in secondary structure content [3]. 
It has also been suggested that the a-helix is an energetically favorable conformation 
for main-chain atoms but the side-chain suffers from a loss of entropy [3, 4]. Nelson et 
al. [5] have shown both numerically and experimentally that non-biological oligomers 
fold reversibly like proteins into a specific three-dimensional structure with high helical 
content driven only by solvophobic interactions. 

Recent studies have attempted to explain the emergence of secondary structure in pro- 
teins from geometrical principles rather than invoking detailed chemistry. Despite the 
concerted efforts of several groups, a simple general explanation remains elusive. A very 
natural line of investigation was undertaken by Yee et al. [6] , Hunt et al. [3] , and Socci 
et al. [7] who focused on the spontaneous emergence of secondary content from the mere 
requirement of overall compactness of homopolymeric chains. Their findings ruled out 
any significant relationship between the two, a fact also corroborated by the recent study 
of the kinetics of homopolymer collapse, where no evidence was found for the formation 
of local regular structures [8]. Despite the failure, this approach is particularly inter- 
esting due to the fact that optimal packing is a fundamental and fascinating problem 
in contexts ranging from every day like to atomic physics. Perhaps, the best known 
packing problem is the one introduced by Kepler nearly fours centuries ago, concerning 
the optimal packing of sphere. However the packing of independent objects, like spheres, 
must be treated differently from the case of objects connected in a chain, such as beads 
in a string (an idealization of peptide chains). In Sec. 3, the packing problem is general- 
ized to such chains and, remarkably, if one requires optimal packing uniformly along the 
chain, then a particular type of helix becomes the solution and, furthermore, it has the 
same geometrical characteristics as a-helices found in proteins. 

In addition to packing considerations, dynamical effects also play a significant role when 

rapid packing/impacking is entailed, as in the formation of amorphous glasses where 
crystallization is dynamically thwarted or in the more familiar problem of packing clothes 
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in one's suitcase. The same question may be asked for protein-like structure. The fact 
that they contain motifs which are optimally compact, does not imply that they can be 
easily reached from unfolded states. It is, however, widely believed that native states 
are, in general, highly accessible from the kinetic point of view. To investigate this issue 
we formulate, in Sec. 4, a dynamical variational principle for selection in conformation 
space based on the requirement that the backbone of the native state of biologically 
viable polymers be rapidly accessible from the denatured state. The variational principle 
is shown to result in the emergence of helical order in compact structures, revealing a 
surprising accord with the compactness requirement discussed above. 

Still concerning the folding dynamics, there are two key aspects distinguishing a protein 
from a generic heteropolymer: the specially selected sequence of amino acids and the 
three-dimensional structure that it folds reversibly into. Nature uses a rich repertory of 
twenty kinds of amino acids with sometimes major and at other times subtle differences in 
their interactions with the solvent and with each other in order to design sequences that 
fit the putative native state with minimal frustration [9] . The chosen sequences are such 
that their target native states are reached through a funnel-like landscape [10, 11, 12, 13] 
which facilitates the harmonious fitting together of pieces to form the whole. The three- 
dimensional structure impacts on the functionality of the protein and a fascinating issue 
is the elucidation of the selection mechanism in conformation space that picks out certain 
viable structures from the innumerable ones with a given compactness. Earlier studies 
have shown that there is a direct link between viable native conformations and high 
designability [14, 15]. 

A fruitful and general strategy for the study of protein folding would be to extract 
information on the folding process directly from the topology of the native state. This 
problem will be elucidated in Sec. 5 and applied to HIV protease and in Sec. 6 to 
membrane proteins. It will be shown that the natural folds of proteins have a much 
larger density of nearby structures than generic (artificial) conformations of the same 
character and that the exceedingly large geometrical accessibility of natural proteins 
may be related to the presence of secondary motifs [16]. It will be shown that a study 
of the influence of native state topology on the folding process can reveal information 
about the sites that are crucial to the folding process itself. As an application, we shall 
identiiy such sites for three proteins: 2ci2, barnase and HIV-1 Protease and show that 
they correlate very well with the key folding sites identified in experiments. 

In Sec. 6, a general model based on topological properties of the native state is introduce 
to decipher the folding of membrane proteins. Nearly a quarter of genomic sequences 
and almost half of all receptors that are likely to be targets for drug design [17] are 
integral membrane proteins. Understanding the detailed mechanisms of their foldinging 
mechanism is a largely unsolved, key problem in structural biology. By using our geo- 
metrical approach we can investigate the equilibrium properties and the folding kinetics 
of a two helix bundle fragment (comprising 66 amino-acids) of Bacteriorhodopsin. Once 
again, the approach seems to be extremely powerful and it appears to provide an efficient 
framework for understanding the variety of folding pathways of transmembrane proteins. 
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3. — Optimal shape of a compact polymeric chain 

A fundamental problem in every day life is that of packing with examples ranging from 
fruits in a grocery, clothes and personal belongings in a suitcase, atoms and colloidal 
particles in crystals and glasses, and amino acids in the folded state of proteins [18, 
19, 20, 21, 22]. The simplest problem in packing consists of determining the spatial 
arrangement that accomodates the highest packing density of its constituent entities 
with the result being a crystalline structure. 

A classic problem is the determination of the optimal arrangement of spheres in three 

dimensions in order to achieve the highest packing fraction. This problem first posed 
by Kepler has attracted much interest culminating in its recent rigorous mathematical 
solution [18, 19] that the answer for infinite systems is a face-centred-cubic lattice. This 
simply stated problem has had a profound impact in many areas [20, 21, 22], ranging 
from the crystallization and melting of atomic systems, to optimal packing of objects 
and subdivision of space. 

The close-packed hard sphere problem is simply stated: given N hard spheres of radius 
R, what is the arrangement which can be enclosed in the minimum volume, e.g. a cube 
of side L. This is solved by reformulating the problem, more convenient for numerical 
implementation, as the determination of the arrangement of a set of N points in a cube 
of linear size, Lq, that results in the minimum of half the distance between any pair of 
points or between the points and walls of the container, denoted by rmin, being as large 
as possible [23]. The linear size associated with the region enclosing the hard spheres 
follows from dimensional analysis and it is given hy L = RLo/rmin, from which it follows 
that maximizing r„„;„ is equivalent to minimizing L. It is notable that the resulting 'bulk' 
optimal arrangement in the large N limit exhibits translational invariance in that, far 
from the boundaries, the local environment is the same for all points. In dimension d = 2 
and d = 3 this corresponds to triangular and face-centred-cubic lattices respectively. 
Biopolymers like proteins, DNA and RNA have three dimensional structures which are 
rather compact. Furthermore, they are the result of evolution and one may think that 
their shape may satisfy some optimality criterion. This naturally leads one to consider 
a generalization of the packing problem of hard spheres to the case of flexible tubes 
with a uniform cross section. The packing problem then consists in finding the tube 
configuration which can be enclosed in the minimum volume without violating any steric 
contraints. 

The problem can alternatively be formulated in a very simple and elegant way in terms 
of the curve which is the axis of the tube (the analog of the sphere centers in the hard 
sphere packing problem) [24]. Consider a string (an open curve) in three dimensions. We 
will utilize a geometric measure [25] of the curve, the 'rope-length', defined as the arc 
length measured in units of the thickness, which has proved to be valuable in applications 
of knot theory [25, 26, 27, 28, 29, 30]. The thickness A denotes the maximum radius of a 
uniform tube with the string passing through its axis, beyond which the tube either ceases 
to be smooth, owing to tight local bends, or it self-intersects. Our focus is on finding 
the optimal shape of a curve of fixed arc length, subject to constraints of compactness. 
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which would maximize its thickness, or equivalently minimize its rope length. 
Following the approach of Gonzalez and Maddocks [28] , who studied knotted strings, we 
define a global radius of curvature as follows. The global radius of curvature of the curve 
at a given point is computed as the minimum radius of the circles going through that 
point and all other pairs of points of the string. It generalizes the concept of the local 
radius of curvature (the radius of the circle which locally best approximates the curve) 
by taking into account both local (bending of the string) and non-local (proximity to 
another part of the string) effects. For discrctizcd curves, the local radius of curvature 
at a point is simply the radius of the circle going through the point and its two adjoining 
points. The minimum of all the global radii then defines the thickness, i.e. the minimum 
radius of the circles going through any triplet of discrete points. This coincides with 
the previous definition in the continuum limit, obtained on increasing the number of 
discretized points (assumed to be equally spaced) on the curve keeping the string length 
fixed [28]. Given a string configuration, the thickness is just the maximum allowed radius 
for the cross section of a uniform tube that has the given curve as its axis [28] . 
We used several different boundary conditions to enforce the confinement of the string. 
The simplest ones discussed here are the confinement of a curve of length I within a cube 
of side L or constraining it to have a radius of gyration (which is the root-mean-square 
distance of the discretized points from their centre of mass) that is less than a pre-assigned 
value R. Even though different boundary conditions influence the optimal string shape, 
the overall features are found to be robust. Examples of optimal shapes, obtained from 
numerical simulations, for different ratios of l/L and l/R are shown in Fig. 1. In both 
cases, two distinct families of curves, helices and saddles, appear. The two families are 
close competitors for optimality and different boundary conditions may stabilize one over 
the other. For example, if optimal strings of fixed length are constrained to have a radius 
of gyration less than R, then upon decreasing R, the curve goes from a regime where the 
trivial linear string is curled into an arc, then into a portion of helix and finally into a 
saddle. When the string is constrained to lie within a cube of size L, as L decreases first 
saddles are observed and then helices. 

We have also been able to find bulk-like solutions which are not infiuenced by boundary 
effects. Such solutions can be obtained by imposing uniform local constraints along the 
curve. On imposing a minimum local density on successive segments of the string (for 
example, constraining each set of six consecutive beads to have a radius of gyration that 
is less than a preassigned value R), we obtained perfectly helical strings, corresponding 
to discretised approximations to the continuous helix represented in Fig. 2 confirming 
that this is the optimal arrangement. Note that, in close analogy with the sphere-packing 
problem, a helix has translational invariance along the chain. 

In all cases, the geometry of the chosen helix is such that there is an equality of the 
local radius of curvature (determined by the local bending of the curve) and the radius 
associated with a suitable triplet of non- consecutive points lying in two successive turns 
of the helix. This is a feature that is observed only for a special ratio c* of the pitch, p, 

and the radius, r, of the circle projected by the helix on a plane perpendicular to its axis. 
When p/r > c* = 2.512 the local radius of curvature, given by p = r(l -|- p^/(27rr)^). 
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(d) (e) (f) 

Fig. 1. - Examples of optimal strings. The strings in the figure were obtained starting from 
a random conformation of a chain made up of N equally spaced points (the spacing between 
neighboring points is defined to be 1 unit) and successively distorting the chain with pivot, 
crankshaft and slithering moves commonly used in stochastic chain dynamics [96] . A Metropolis 
Monte Carlo procedure is employed with a thermal weight, e"*"^^^ , where A is the thickness 
and T is a fictitious temperature set initially to a high value such that the acceptance rate is 
close to 1 and then decreased gradually to zero in several thousand steps. Self-avoidance of the 
optimal string is a natural consequence of the maximization of the thickness. The introduction 
of a hard-core repulsion between beads was found to significantly speed up convergence to the 
optimal solution and avoid trapping in self-intersecting structures. We have verified that the 
same values (within 1 percent) of the final thickness of the optimal strings are obtained starting 
from unrelated initial conformations. Top row: optimal shapes obtained by constraining strings 
of 30 points with a radius of gyration less than R. (a) R = 6.0, A = 6.42 (b) R = 4.5, A = 3.82 
(c) R = 3.0, A = 1.93. Bottom row: optimal shapes obtained by confining a string of 30 points 
within a cube of side L. (d) L = 22.0, A = 6.11 (e) L = 9.5, A = 2.3 (f) L = 8.1, A = 1.75. 



is lower than the half of the distance of closest approach of points on successive turns 
of the helix. The latter is given by the first minimum of 1/2-^2 — 2cos(27rt) + pH'^ for 
t > 0. Thus A = p in this case. 

lip/r < c*, the global radius of curvature is strictly lower than the local radius, and 
the helix thic;knc'ss is determined basically by the distance between two consecutive helix 

turns: A ~ p/2 lip/r <^ 1. Optimal packing selects the very special helices corresponding 
to the transition between the two regimes described above. A visual example is provided 



Fig. 2. - Shape of the optimal helix. The ratio of the pitch to radius of the centerline is 2.512. 



by the optimal helix of Fig. 2. 

For discrete curves, the critical ratio p/r depends on the discretization level. A more 
robust quantity is the ratio / of the minimum radius of the circles going through a given 
point and any two non-adjacent points and the local radius. For discretized strings, f ~ 1 
just at the transition described above, whereas / > 1 in the 'local' regime and / < 1 in 
the 'non-local' regime. In our computer-generated optimal strings, / differed from unity 
by less than a part in a thousand. 

It is interesting to note that, in nature, there are many instances of the appearance of 
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helices. For example, many biopolymers such, as proteins and enzymes, have backbones 
which frequently form helical motifs. (Rose and Seltzer [31] have used the local radii of 
curvature of the backbone as input in an algorithm for finding the peptide chain turns 
in a globular protein.) It has been shown [16] that the emergence of such motifs in pro- 
teins (unlike in random heteropolymers which, in the melt, have structures conforming 
to Gaussian statistics) is the result of the evolutionary pressure exerted by nature in the 
selection of native state structures that arc able to house sequences of amino acids which 
fold reproducibly and rapidly [32] and are characterized by a high degree of thermody- 
namic stability [33] . Furthermore, because of the interaction of the amino acids with the 
solvent, globular proteins attain compact shapes in their folded states. 
It is then natural to measure the shape of these helices and assess if they are optimal 
in the sense described here. The measure of / in a-helices found in naturally-occurring 
proteins yields an average value for / of 1.03 ± 0.01, hinting that, despite the complex 
atomic chemistry associated with the hydrogen bond and the covalent bonds along the 
backbone, helices in proteins satisfy optimal packing constraints. An example is provided 
in Fig. 3 where we report the value of / for a particularly long a-helix encountered in a 
heavily-investigated membrane protein, bacteriorhodopsin. 

This result implies that the backbone sites in protein helices have an associated free 
volume distributed more uniformly than in any other conformation with the same den- 
sity. This is consistent with the observation [16] that secondary structures in natural 
proteins have a much larger configurational entropy than other compact conformations. 
This uniformity in the free volume distribution seems to be an essential fcatiirc because 
the requirement of a maximum packing of backbone sites by itself does not lead to sec- 
ondary structure formation [6, 7]. Furthermore, the same result also holds for the helices 
appearing in the collagen native state structure, which have a rather different geometry 
(in terms of local turn angles, residues per turn and pitch [34]) from average a-helices. 
In spite of these differences, we again obtained an average / = 1.01 ± 0.03 (Fig. 4), very 
close to the optimal situation. 



4. — Fast folding polymers and role of secondary motifs 

Fast packing has been recognized as a central issue for biopolymers, such as proteins, 
since the early work of Levinthal [35]. The packing of strings as defined above and the 
slope of the optimal string we have found should also be kinetically very accessible. We 
postulate a direct connection between the dynamics of rapid folding and the emergence 
of secondary motifs in the native state conformations [32]. In fact, an intuitive approach 
to rapid and reproducible folding might be to create neat patterns of lower dimensional 
manifolds than the physical space and bend and curl them into the final folded state. 
For proteins, secondary structures such as a-helices and /3-sheets are indeed patterns in 
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Fig. 3. - Top. Local and non-local radii of curvature for sites in the first helix of bacteri- 
orhodopsin (pdb code lc3w). Bottom. Plot of / values for the same sites. 



low dimensions. 

The selection mechanism in structure space is formulated as a variational principle pos- 
tulating that, among all possible native conformations, a protein backbone will attain 
only those which are optimal under the action of evolutionary pressure favouring rapid 
folding. Our goal is to elucidate the role played by the bare native backbone independent 
of the selection in sequence space and hence of the (imperfectly-known) inter-amino-acid 
potentials. We therefore choose to employ a Go-like model [36] with no other interaction 
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Fig. 4. - Packing of collagen helices. Plot of / values as a function of sequence position for 
a single collagen helix (only Ca coordinates were used to identify the protein backbone). The 
same plot for each of the three collagen chains would simply superimpose. We considered the 
residues 14-42 from the structure laq5 in the Protein Data Bank. 



that promotes or disfavours secondary structures. The model is a sequence-independent 
hmiting case of minimal frustration[9] which, for a given target native state conformation, 
favours the formation of native contacts - the energy of a sequence in a conformation is 
simply obtained as the negative of the number of contacts in common with the target 

conformation. Wc will work in the Ca representation and consider two non-consccutive 
amino acids to be in contact if their separation is below a cutoff ro = 6.5 A(the results are 



11 



qualitatively similar when slightly different values of rg in the range 6 — 8 Aare chosen) . 
The energy of structure F in the Go model is given by 

(1) ^f(r) = -^EA,,,(r)A,,,(ro) 

where the sum is taken over all pairs of amino acids, Fq is the target structure, Aij{r) 
is the contact map of structure F: 

otherwise, 

where Rij is the distance of amino acids i and j . 

The polypeptide chain is modelled as a chain of beads subject to steric constraints [37, 16]. 
We adopted a discrete representation similar to the one of Covell and Jernigan [37, 38], 
in which c;acli bead occupies a site of an FCC lattice with lattice spacing equal to 3.8 A. 
Such a representation is able to describe the backbone of natural proteins to better that 
1 A rmsd per residue (equal to the best experimental resolution) and preserves typical 
torsional angles. All discrctizcd structures were subject to a suitable constraint: any two 
non-consecutive residues cannot be closer than 4.65 A due to excluded volume effects and 
the distance between consecutive residues can fluctuate between 2.6 A< d < 4.7 A. Such 
constraints were determined by an analysis of the coarse-grainings of several proteins of 
intermediate length 100 residues). In order to enforce a realistic global compactness 
for a backbone of length L, the number of contacts in all the target structures considered 
was chosen [39] to be around N = 1.9L while, locally, no residue was allowed to make 
contact with four or more consecutive residues. 

In order to assess the validity of the variational principle, it is necessary to evaluate the 
typical time, t{TQ), taken to fold into a given target structure, Fq, followed by a selection 
of the structures Fq, that have the smallest folding times. To do this, an initial set of 
ten conformations was generated by collapsing a loose chain starting from random initial 
conditions. In each case, we modified the random initial conformation by using Monte 
Carlo dynamics: we move up to 3 consecutive beads to unoccupied discrete positions 
that do not violate any of the physical constraints and accept the moves according to 
the standard Metropolis rule. The energy is given by eq. (1), while the temperature for 
the MC dynamics was set to 0.35. This value was chosen in preliminary runs so that it 
was higher than the temperature [9] below which the sequence is trapped in metastable 
states but comparable to the folding transition temperature so that conformations with 
significant overlap with the native state are sampled in thermal equilibrium. 
For each structure, as a measure of the folding time we took the median over various 
attempts (typically 41) of the total number of Monte Carlo moves necessary to form a pre- 
assigncd fraction of native contacts, typically 66%, starting from a random conformation. 
Our results were unaltered on increasing this fraction to 75%; indeed, this fraction could 
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be progressively increased towards 100% with successive generations without increase in 
the computational cost since better and better folders are obtained. 
A new generation of ten structures is created by "hybridizing" pairs of structures of the 
previous generation ensuring that structures with small folding times arc hybridized more 
and more frequently as the number of generations, (7, increases [40]. To do this, each of 
the two distinct parent structures to be paired, Fi and r2 are chosen with probability 
proportional to exp[— (g — 1) * /t)/1000], where g is the index of the current generation 
(initially equal to 1), ft is the median folding time. Then, a hybrid map is created by 
taking the union of the two parent maps: 

(3) A,^"-" = ma^(A,,.(ri),A,,(r2)). 

Because it is not guaranteed that A^"*°" corresponds to a three-dimensional structure 
obeying the same physical constraints as Fi and F2, the corresponding hybrid F is con- 
structed by taking one of the two parent structures (or alternatively a random one) as the 
starting conformation and carrying out MC dynamics favouring the formation of each of 
the contacts in the union map (i.e. using eq. (1) with Ajj(Fo) substituted by A^"*°"). 
The dynamics is carried out starting from a temperature of 0.7 and then decreasing it 
gradually over a sufficiently long time (typically thousands of MC steps) to achieve the 
maximum possible overlap with the union map, while simultaneously maintaining the 
realistic compactness. The resulting structure is typically midway between the two par- 
ent structures, in that it inherits native contacts from both of them. We adopted the 
following definition in order to obtain an objective and unbiased way to quantitatively 
estimate the presence of secondary content: a given residue, i was defined to belong to a 
secondary motif if, for some j, one of these conditions held: 



a) Aj_ij_i = Ajj = Aj+ij+i = Ajj+i 

= Aj+ij+2 = Aj_ij = 1; 

b) Ai+ij_i = Ajj = Ai_ij+i = Ajj+i 

= Aj+ij = Aj_ij+2 = 1- 

The former [latter] identifies the presence of helices and parallel [anti-parallel] /3 sheets 
in natural proteins, which can be identified by the visual inspection of contact matrices 
and appears as thick bands parallel or orthogonal to the diagonal. 
The upper plot of Fig. 5 shows the decrease of the typical folding time over the genera- 
tions for chains of length 25, while the middle panel shows the accompanying increase in 
the number of residues in secondary motifs (secondary content). The bottom panel shows 
a milder decrease of the contact order (i.e. a larger number of short-range contacts) as 
the generations evolved, in agreement with the experimental findings of Plaxco et aZ.[41] 
One of the optimal structures of length 25 is shown in Figure 6a. Due to the absence of 
any chirality bias in our structure space exploration, the helix does not have a constant 
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generation 




generation 

Fig. 5. - Evolution of the median folding time (measured in Monte Carlo steps), secondary 
structure content and contact order as a function of the number of generations in the optimiza- 
tion algorithm for compact structures of length L — 25. The dashed curve denotes an average 
over all ten structures in a given generation, whereas the solid curve shows the behaviour of 
the structure at each generation with the fastest median folding time. Analogous results are 
obtained for other runs and for other values of L. The dramatic decrease of folding time is 
accompanied by an equally significant increase in the secondary content. 



handedness. The signature of the secondary motifs in the optimal structures is clearly 

visible in the contact maps of Figure 7, which arc not sensitive to structure chirality. 
Strikingly, the variational principle selects conformations with significant secondary con- 
tent as those facilitating the fastest folding. It is also noteworthy that the average value 
of / defined in Sec. 3 is 0.9, which is very close to the ideal value, f — I, despite the fact 
that the underlying FCC lattice prevents the structures from attaining a regular helical 
shape. 

The correlation of the emergence of secondary structures with decrease of folding times 

is shown in the plot of Fig. 8. We verified that the hybridization procc;dure is not 
biased towards low contact order by iterating it for various generations and pairing the 
structures at random. Even after dozens of generations, the generated structures had 
secondary contents of about 1/3-1/4 of the true extremal structures. 
The very high secondary content in optimal conformations was found to be robust against 
changes in chain length or compactness of the target structure. On requiring that the 
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(a) (b) 

Fig. 6. - a) RASMOL plot of a structure with very low median folding time and L = 25. 
b) Structure with very low median folding time, L = 25 and higher compactness (all target 
conformations were constrained to have a radius of gyration smaller than 6.5 A). Optimal 
compact structures correspond to helices packed together, as observed in naturally occurring 
proteins. 



structure be more compact, bundles of helices emerge [see Fig. 6b] along with an increase 
in contact order, signalling the presence of some longer range contacts, which are neces- 
sitated in order to accomodate the shorter radius of gyration. It is noteworthy that our 
calculations lead predominantly to a-helices and not /3 sheets, a fact accounted for by the 
demonstration that steric overlaps and the associated loss of entropy lead to the destabi- 
lization of helices in favor of sheets [4] , the appearance of such sheets only in sufficiently 
long proteins [42] and the much slower folding rate of /?-sheets compared to a-helices [43]. 
It is remarkable that the same requirement of rapid folding is sufficient to lead to a selec- 
tion in both sequence and structure space underscoring the harmony in the evolutionary 




(a) (b) 

Fig. 7. - The panel on the loft [right] shows the contact map of a structure with a very low 
[average] median folding time. The signature of helices in map (a) is shown by the thick bands 
parallel to the diagonal, while no such patterns are observed in the matrix (b). 
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secondary content 

Fig. 8. - Scatter plot of folding time versus secondary content for structures of length 25 collected 
over several generation of the optimization algorithm. 



design of proteins. The results and strategies presented here ought to be applicable in 
protein-engineering contexts, for example by ensuring optimal dynamical accessibility of 
the backbone of proteins. A systematic collection of the rapidly-accessible structures of 
various length should also lead to the creation of unbiased libraries of protein folds. 



5. — Density of overlapping conformations for protein structures and role of 
native state topology in the folding process 

The rapid and reversible folding of proteins-like heteropolymers into their thermody- 
namically stable native state [44] is accompanied by a huge reduction in conformational 
entropy [45, 46]. Evidence has been accumulating for an achievement of the entropy 
reduction through a folding funnel favoring the kinetic accessibility of the native state 
[47, 9, 11, 48, 49]. In Sec. 3 and 4, we have seen that secondary motifs of proteins may 
arise from the requirements of both being compact and easily kinctically accessible. Here 
we focus on a further characterization of the special role played by the native structure 
of proteins; again we will make no use of detailed information regarding amino acid se- 
quences [16]. The study is carried out through a theoretical probe of the conformation 
space of proteins: a measure of the density of overlapping conformations (DOC) having 
a given overlap or percentage of contacts in common with a fixed native structure. We 
show with studies on chymotrypsin inhibitor (reference 2ci2 of the Protein Data Bank) 
and barnase (la2p) that the DOC provides key information on the folding pathway. An 
analysis of the DOC for real protein structures and for artificially generated decoy ones 
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suggests that an extremal principle is operational in nature, which maximizes the DOC 
at intermediate overlap, providing a large basin of attraction [13, 50, 9, 11, 48] for the 
native state and promoting the emergence of secondary structures. 

Our study consists of determining the number of structures with a given structural sim- 
ilarity to a putative native state. The structural similarity between the native structure 
and another one is defined as the percentage of native contacts in the alternative con- 
formation. It is well known that such a measure is a good coordinate characterizing the 
folding process [36, 51, 33]. To this purpose we adopt the Go scheme introduced in the 
previous section (sec eqn. 1). We also make again use of the FCC coarse-graining to 
avoid considering as distinct conformations that differ slightly. 

The generation of conformations was carried out using a standard Monte Carlo procedure 
(see e.g. Refs. [33, 53]) which allows one to move simultaneously up to 7 randomly chosen 
Ca's to unoccupied FCC sites. 

In order to minimize the effects of correlation between successively generated structures, 
we typically discarded 50 elementary moves before accepting each new conformation. A 
newly generated conformation was accepted with the usual Metropolis rule according to 
the change in the Boltzmann weight: e^/^^"^, where A is the change in contact overlap 
and T is a fictitious temperature. By choosing T appropriately, one can readily generate 
conformations with a desired average contact overlap, q. At a given temperature, the 
true number of structures with overlap q is proportional to the number of conformations 
with overlap q obtained in the simulation multiplied by the Boltzmann weight. On 
undoing the Boltzmann bias, it is possible to recover the true density of conformations 
in a region around q. In order to obtain the density of conformations for all values of 
overlap, we performed 2500 Monte Carlo samplings at different decreasing temperatures 
and then used standard deconvolution procedures [54]. Overall, for each distinct value 
of the overlap, more than 1000 structures were sampled. We have confirmed that the 
DOC curves are independent of the starting conformation and that the "folding" DOC 
obtained starting from a random conformation and cooling agrees to better than 3 % 
with the "unfolding" DOC obtained starting from the target structure and increasing 
the temperature. 

We begin with the backbones of the chymotrypsin inhibitor and barnase. We generated 
2500 structures with a not too large overlap [55] (« 40%) for each of them. It turned out 
that the most frequent contacts shared by the native conformation of 2ci2 with the others 
involved the helical-residues 30-42 (see top Fig. 9). Contacts involving such residues 
were shared by 56% of the sampled structures. On the other hand, the rarest contacts 
pertained to interaction between the helix and /3-strands and between the /3-strands 
themselves. A different behaviour (see bottom Fig. 10) was found for barnase, where, 
again, for overlap of « 40%, we find many contacts pertaining to the nearly complete 
formation of helix 1 (residues 8-18), a partial formation of helix 2, and bonds between 
residues 26-29 and 29-32 as well as several non-local contacts bridging the /3-strands, 
especially residues 51-55 and 72-75. 

Both this picture, and the one described for CI2 are fully consistent with the experimental 
results obtained by Fersht and co-workers in mutagenesis experiments [57, 58]. In such 
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Fig. 9. - Ribbon plot (obtained with RASMOL) of 2ci2 (top) and barnase (bottom). The 
residues involved in the 12 [16] most frequent contacts of alternative structures with overlap 
« 40% with the native conformations arc highlighted in black. The majority of these coincide 
with contacts that are formed at the early stages of folding. 



experiments, the key role of an amino acid at a given site is probed by mutating it and 
measuring the changes in the folding and equilibrium characteristics. By measuring the 
change of the folding/unfolding equilibriiim constant one can introduce a parameter, 
termed (/>-value, which is zero if the mutation is irrelevant to the folding kinetics and 
1, if the change in folding propensity mirrors the change in the relative stability of the 
folded and unfolded states (intermediate values are, of course, possible). Ideally, the 
measure of the sensitivity to a given site should be measured as a suitable susceptibility 
to a small perturbation of the same site (or its environment). Unfortunately, this is 
not easily accomplished experimentally, since substitution by mutation can be rarely 
regarded as a perturbation. Notwithstanding this difficulty, from the analysis of the 
0-values obtained by Fersht, a clear picture for the folding stages of CI2 and barnase 
emerges. In both cases, the crucial regions for both proteins are the same as those 
identified through the analysis of the DOC reported above. This provides a sound a 
posteriori justification that the main features of the folding of a protein can be followed 
from a study of the DOC. Remarkably, the method discussed above relies entirely on 
structure-related properties and suggests that the main features of the folding funnel are 
determined by the geometry of the "bare" backbone, while the finer details, of course, 
depend on the specific well-designed sequence. Since our own work [16], other groups have 
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Fig. 10. - Distribution of sequence separation of contacts commonly found in the conformations 
that overlap with the native state structures of 2ci2 and la2p. The most frequent contacts for 
2ci2 have a small sequence separation (3-4) and pertain to helix formation. The la2p case shows 
a very different behaviour with several contacts with very large sequence separation. 



used similar or alternative techniques to elucidate the role of the native state topology 
in the folding process [59, 43, 60, 61], confirming the picture outlined here. 
Let us consider one way in which proteins, in general, are special and diflFerent from 
arbitrary compact polymers. To do so, we turn to an analysis of three proteins of length 
51 (Ihcg, Ihja and Isgp) which have nearly the same number of native contacts (~ 83). 
For each structure, we calculated the DOC with the constraint that the total number 
of contacts in the alternative structures do not exceed the number of contacts in the 
native state by more than 10 % to avoid excessive compactness. To assess whether the 
DOC associated with naturally occurring proteins have special features, we generated 
three decoy conformations of the same length and number of contacts, but with different 
degrees of short and long range contacts (in sequence separation). These decoys (subject 
to the aforementioned "physical constraints" ) were generated with a simulated annealing 
procedure to find the structure with the highest overlap with a target contact matrix. By 
tuning the number of short-range versus long-range entries in the target random contact 
matrix, we generated three structures with different degrees of compactness and local 
geometrical regularity. 

The logarithmic plots of the DOC are shown in Fig. 11. A striking feature of the curves 

is that, for intermediate overlap, the DOC of the real proteins is enormously larger than 
that of the decoys and suggests that naturally occuring conformations have a much larger 
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Fig. 11. - Density of overlapping conformations for proteins for Isgp (filled squares), Ihja (filled 
pentagons) and Ihcg (filled hexagons). Curves for artificial decoy structures are denoted by the 
open symbols. 



number of entryway structures than random compact conformations. Furthermore, for 
very high values of the overlap, the steepness of the protein curves is much larger than 
those of the decoys, showing that the reduction in the conformational entropy is also 
higher. This implies the existence of a funnel with a very large basin and steep walls. 
Another significant feature is the good collapse of the protein curves. We verified that 
this feature also obtains for IbdO and 2pk4 which each have 80 residues and 140 and 
146 contacts respectively. A simple explanation for the curve collapse, could be that the 
DOC of real proteins is "extremal" , in that it is close to the maximum possible value for 
intermediate values of the overlap. 

The importance of the locality of contacts for folding kinetics was highlighted recently by 
Plaxco et al. [41] who found a correlation between folding rate and contact order, defined 
as the average sequence separation of contacts normalized to the total number of contacts 
and sequence length. With reference to Fig. 11, the contact order values for proteins 
Ihcg, Ihja and Isgp are 0.139, 0.214 and 0.204 respectively. For the decoy structures, 
they arc 0.424, 0.222 and 0.179 for the curves denoted by open squares, pentagons and 
hexagons, respectively. The structure with an unusually high contact order has the lowest 
DOC curve and optimal sequences designed on it (or equivalently a Go-like model) would 
be expected to exhibit slow folding dynamics [62] in accord with the findings of Plaxco 
et al. [41]. 

Secondary structure motifs [34, 15] have characteristic signatures in the contact maps, 
such as bands parallel to the diagonal (a-helices and parallel /J-sheets) or orthogonal 

to it (antiparallel /3-sheets), as it has been shown in the previous section. We have 
carried out some simple investigations to assess whether a correlation exists between the 
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Fig. 12. - The upper [lower] triangle shows a target contact matrix with L = 60 that has a large 
[intermediate] number of contact maps with an overlap of qmax — 2 contacts. 



extremality of the DOC curve and the emergence of sccondary-structure-likc motifs. Wc 
considered a space of contact maps [52], within which each of the residues interacted 
with the same number of other residues, ric (typically Tic = 5, as in the average case of a 
protein with about 100 residues and a cutoff distance of 6.5 A). This space contains maps 
corresponding to both real structures and unphysical ones. Furthermore, to mimic the 
effects of the rigidity and geometry of the peptide bond, we disallowed contacts between 
residue i and the four neighboring residues along the sequence i±2, i±l. 
In this context, the maximization of the density of states corresponds to finding the target 
matrix with the highest number of matrices sharing a given fraction of its contacts. 
Although it is difficult to solve this problem, for arbitrary values of the overlap, it is 
relatively easy to generate matrices with an overlap close to the maximum value, Qmax 
(for a LxL matrix, Qmax = L ■ ric). To enumerate all matrices with overlap Qmax — 2, one 
first identifies a pair of non-zero entries in the target matrix m: = fhki = 1- Then it 
is necessary to check whether entries fhu, fhkj are both "free" (i.e. equal to zero) and do 
not correspond to forbidden contacts (e.g. between i and i + 1). If this is so, the old pair 
of entries (and their symmetric counterpart) are set to zero, and the new ones to 1. By 
considering, in turn, all possible pairs of non-zero entries one can generate all matrices 
of overlap Qmax ^ 2 . Then, by performing a simulated annealing in contact-map space 
one can isolate the map having the highest number of matrices with overlap Qmax — 2. 
We carried out our calculations for values of L around 60. The optimal matrices exhibit 
clustering reminiscent of a-helices and /3-sheets, as shown in the upper triangle of Fig. 
12. A more quantitative measurement of the secondary-structure content of the optimal 
matrices can be obtained by considering the correlation functions, g±{x) = X^jTOj^jix) 
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Fig. 13. - Correlation functions (sec text) for an optimal target matrix of length 60 and for a 
protein of length 62 taken from the protein data bank. 



which show peaks in correspondence with the sequence separation of residues involved in 
a-hehces and parallel /3-sheets {g+) or antiparallel /3-sheets {g-)- A typical plot of the 
correlation functions for an optimal map of length 60 and for the protein 3ebx (length 
62) arc shown in Fig. 13. The similarity of the plots is striking, particularly because, in 
both cases, the height of the peaks in decreases with sequence separation, unlike the 
situation with g^. 

In summary, the geometry of protein backbones seems to have been optimized to provide 
a large basin of attraction to the native state. The results presented here are suggestive 
of an extremality principle underlying the selection of naturally occurring folds of pro- 
teins which, in turn, is shown to be possibly associated with the emergence of secondary 
structures. This observation complements the one made in the previous section, which 
highlighted how the presence of secondary motifs boosts the folding kinetics. Strikingly, 
by examining the DOC associated with a given native structure, it is possible to extract 
a wealth of information about the sites involved in crucial stages of the folding process. 
Despite the fact, that such sites are determined from the analysis of their crucial topolog- 
ical role with respect to the native state with no input of the actual protein composition, 
they correlate very well with the key sites determined experimentally. A striking example 
is provided in the following subsection, which focuses on an enzyme encoded by the HIV 
virus. It will be shown that from the mere knowledge of the contact map of the enzyme, 
one can isolate a handful of hot sites which correlate extremely well the key mutating 
sites determined in clinical trials of anti-AIDS drugs. 
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51. Application to HIV-1 protease: drug resistance and folding Pathways. - A close ex- 
amination of the curves for the density of states (DOC) can reveal the presence of phase 
transitions in the system (or more accurately, their analog for finite-size systems). In 
particular, a change of the concavity/convexity of the DOC curves signals a first-order 
like transition. This is illustrated in the examples of Figs. 14 and 15 where the curve for 
the DOC of protein Ihja is shown and accompanied with a graph of the energy versus 
temperature and the specific heat (obtained through histogram rcwcighting techniques). 
Indeed, the peak in the specific heat identifies the folding transition temperature. Nat- 
urally, one would like to identify the pairs of sites that contribute most to the Cy peak. 
For a given pair, (i, j), this amounts to measuring, as a function of the temperature, T, 
the quantity 

(4) a(i,i) = ^{E{i,3)ETot) - {E{i,j)){ETot) 

where E{i,j) is the contribution of the pair to the total interaction energy, Exot- The 
most crucial residues will then be identified with those with the highest values of C'i,{i,j), 
which are expectedly observed near the folding transition temperature. We have used 
this strategy to determine succesfully the key sites of HIV-1 [66]. Our simulations, at 
variance with that described earlier, are now carried out in the continuum. As usual, 
amino acids are represented as effective centroids placed on Ca atoms, while the peptide 
bond between two consecutive amino acids, z, i -|- 1 at distance rj^j+i is described by the 
anharmonic potential adopted by Clementi et al. [63], with parameters a = 20, b = 2000. 
The interaction among non-consecutive residues is treated again in Go-like schemes [36] 
which reward the formation of native contacts with a decrease of the energy scoring 
function. Each pair of non-consecutive amino acids, i and j, contributes to the energy 
scoring function by an amount: 



(5) 

where ro = 6.8A, r^- denotes the distance of amino acids i and j in the native structure 
and £ is the native contact matrix whose entries are 1 (0) if i and j are (not) in contact 
in the native conformation. The contact is defined as in the previous section, i.e. two 
aminoacids are in contact if their distance is less than 6.5 A. Vq and Vi are constants 
controlling the strength of interactions (Vq = 20, Vi = 0.05 in our simulations) 
Constant temperature molecular dynamics simulations were carried out where the equa- 
tions of motion are integrated by a velocity- Verlet algorithm combined with the standard 
Gaussian isokinetic scheme [64, 66] . Unfolding processes can be studied within the same 
framework by warming up starting from the native conformation (heat denaturation) . 
The free-energy, the total specific-hcat, Cy, and contributions of the individual contacts 
to Cy were obtained combining data sampled at different equilibrium temperatures with 
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Fig. 14. Plot of the logarithm of the density of states (density of alternative conformations) 
versus energy for protein Ihja. The latter is defined as the negative of the number of native 
contacts present in a given conformation. 



multiple histogram techniques [54]. The thermodynamics quantities obtained through 
such deconvolution procedures did not depend, within the numerical accuracy, on whether 
unfolding or refolding paths were followed. 

The contacts that contribute more to the specific heat peak are identified as the key 
ones belonging to the folding bottleneck and sites sharing them as the most likely to be 
sensitive to mutations. Furthermore, by following several individual folding trajectories 

(by suddenly quenching unfolded conformations below the folding transition temperature, 
Tfoid) we ascertained that all such dynamical pathways encountered the same kinetic 
bottlenecks determined as above. 

For the f3 sheets, the bottlenecks involve amino acids that arc typically 3-4 residues away 
from the turns - specifically, residues 61, 62, 72, 74 for (33, 10, 11, 12, 21, 22, 23 for 
/3i and 44, 45, 46, 55, 56, 57 for (32- At the folding transition temperature, Tfoid, the 
formation of contacts around residues 30 and 86 is observed. The largest contribution to 
the specific heat peak is observed from contacts 29-86 and 32-76 which are, consequently, 
identified as the most crucial for the folding/unfolding process. 

Such sites are physically located at the active site of HIV-1 PR, which is targeted by anti 
AIDS drugs [65]. Hence, within the limitations of our simplified approach, we predict 
that changes in the detailed chemistry at the active site also ruin a key step of the 
folding process. To counteract the drug action, the virus has to perform some very 
delicate mutations at the the key sites; within a random mutation scheme this requires 
many trials (occurring over several months). The time required to synthesize a mutated 
protein with native-like activity is even longer if the drug attack correlates with several 
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Fig. 15. - Plots ol the energy (top) and specific heat (bottom) as a lunction of temperature for 
protein Ihja. The curves were obtained through histogram reweighting techniques. 



bottlenecks simultaneously. 

This is certainly the case for several anti-AIDS drugs. Indeed Table I summarizes the 
mutations for the FDA approved drugs [67]. In Table II, we list the sites sharing the 
most important contacts involved in the four bottlenecks TB, (3i, P2 and P^. Remarkably, 
among the first 23 most crucial sites predicted by our method, there are 7 sites in common 
with the 16 distinct mutating sites of Table I. The probability that two sets of 16 and 23 
sites randomly taken from a total population of 99 (the length of the HIV-1 PR monomer) 
share at least 7 sites is only 3 %. Also note that, all the mutation sites of Table I except 
82, 35, 36 and 90 fall within a mismatch of at most one position from the sites of Table 
II. These results highlight the high statistical correlation between our prediction and the 
evidence accumulated from clinical trials. 

In conclusion, the strategy presented here, which is entirely based on the knowledge of 
the native structure of HIV-1 protease, allows one both to identify the bottlenecks of 
the folding process and to explain their highly significant match with known mutating 
residues [66]. This and similar approaches should be readily applicable to identify the 
kinetic bottlenecks of other viral enzymes of pharmaceutical interest. This could allow a 
fast development of novel inhibitors targetting the kinetic bottlenecks. This is expected 
to dramatically enhance the difficulty for the virus to express mutated proteins which 
still fold efficiently into the same native state with unaltered functionality. 
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Name 



Point Mutations 



Bottlenecks 



RTN [68, 69] 
NLF [70] 
IND [71, 72] 
SQV [71, 72, 73] 
APR [74] 



20, 33, 35, 36, 46, 54, 63, 71, 82, 84, 90 
30, 46, 63, 71, 77, 84, 
10, 32, 46, 63,71, 82, 84 
10, 46, 48, 63, 71, 82, 84, 90 
46, 63, 82, 84 



TB, /32, /33 

TB, /32, 03 
TB, (3i,l32, 03 

TB, 01,02, 03 

TB, /32, Ps 



Table I. - Mutations in the protease associated with FDA-approved drug resistance [67]. Sites 
highlighted in boldface are those involved in the folding bottlenecks as predicted by our approach. 
Pi refers to the bottleneck associated with the formation of the i-th 0-sheet, whereas TB refers 
to the bottleneck occurring at the folding transition temperature Tfou (see next Table). 



6. Application of geometrical models to investigate the folding of membrane 
proteins 

While the behavior of small water soluble globular proteins is reasonably well understood [75, 
76], much less is known about proteins (membrane proteins: MP) [77, 78, 79, 80, 81, 82, 
83, 84, 85, 86, 87, 88] that cross biological membranes and that control solute trans- 
port, signal transmission and energy conversion between the two sides of the membrane. 
This lack of knowledge is related to the difficulty in experimental handling. Membranes 
consist of phospholipid bilayers with a hydrophobic interior: the surface of MP that in- 
teract with such an apolar environment is also hydrophobic and this property causes MP 
to aggregate in aqueous solution, unless detergents are used. This circumstance makes 
crystallization of MP's difhuclt and native structures have been determined only for an 
handful of them. 

Transmembrane proteins (TMP) are the most important and best studied class of MP 
[77, 78, 89]. They are characterized by the presence of long segments (20 — 30) of amino 
acids with a high degree of hydrophobicity. In the native structure, these correspond to 
the transmembrane segments which are inserted in the lipidic interior of the membrane 
[90] . These segments are predominantly made up of a-helices and /J-sheets. The stability 



Bottleneck 


Key sites 


TB 


22, 29, 32, 76, 84, 86 


01 


10,11,13,20,21,23 




44,45,46,55,56,57 


P3 


61,62,63,72,74 



Table II. - Key sites for the four bottlenecks. For each bottleneck, only the sites in the top three 
pairs of contacts have been reported. 
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of a-helices and /3-sheets inside the membrane follow from the formation of hydrogen 
bonds between the backbone atoms - other possibilities are excluded within the apolar 
enviroment[77, 81]. 

A detailed study of TMP has not yet been possible because little is known about the 
amino-acids interactions among themselves, with the solvent and in particular with the 
lipidic interior of the membrane. Here [91], we present a simple strategy to decipher the 
folding kinetics of transmembrane proteins which is directly inspired by the geometrical 
approaches we have previously applied to globular proteins. This approach by-passes the 
details of the complex interactions of the protein in the lipid enviroment by introducing 
effective potentials, induced by the presence of the membrane and the associated interface 
region, that stabilize the native state structure. 

Due to the small number of degrees of freedom involved in our scheme, the dynamics of 
the system can be simulated for the full folding process. Moreover, the free energies of 
the most relevant intermediate states and free energy profiles along the reaction paths 
connecting them can be explicitly calculated by thermodynamic integration. Thus the 
model is able to quantitatively discriminate between the possible reaction paths envisaged 
for the insertion process of TMP across the membrane[77]. 

The TMP we considered is made up with the first 66 amino acids (each one repre- 
sented by a fictitious residue located at the position of the atom) of the first two 
a-helices of bacteriorhodopsin (Fig. 16a). It has been shown that the first two helices 
of bacteriorhodopsin can be considered as independent folding domains [92] and that the 
side-by-side interactions between transmembrane helices play a key role in the stabiliza- 
tion of the protein structure [93]. The membrane is described simply by a slab of width 
w = Zjnaii — Zmin = 26 A. Two uon-bondcd residues are considered to form a contact 
if their distance is less then 6.5 A. In the study of globular proteins, the topology of 
the native state is encoded in the contact map by considering all the pairs of non- 
consecutive residues that are in contact. Here, in addition, the locations of such pairs 
with respect to the membrane has to be taken into account. The contacts are divided 
into three classes: 

• membrane contacts where both i and j residues are inside the membrane; 

• interface contacts with i and j in the interface region [77] outside the membrane; 

• surface contacts with one residue inside the membrane and the other outside. 

Thus a given protein conformation can have a native contact but improperly placed 
with respect to the membrane {misplaced native contact). The crucial interaction po- 
tential between non-bonded residues is taken to be a modified Lennard- Jones 12-10 
potential: 
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a) b) 

Fig. 16. - Structure and thermodynamics of the helical transmembrane protein. a) Ribbon rep- 
resentation of the two-helix fragment of bacteriorhodopsin formed by the first 66 amino-acids. 
The part inside the membrane is shown in red, the part above (below) the membrane in blue 
(green), b) Average equilibrium fraction of native contacts outside, g;, (o), inside, (square), 
and across, qa (A), the membrane as a function of the temperature T. All these quantities are 
expressed in energy unit of e. 



rij and are the distance between the residues (j,j) and their distance in the native 
configuration, respectively. The matrices r(i,j) and Ti{i,j) encode the topology of 
the TMP in the following way: if (i,j) is not a contact in the native state r(z, j) — 
0, Ti(i^j) = 1; if (i, j) is a contact in the native state but not at the proper location (i.e. 
a misplaced contact) r(i,j) = ei.Tiii.j) — 0; if («,j) is a native state contact in the 
proper region Tii^j) — e, Ti[i,j) = 0. This model is intended to describe the folding 
process in the interface and in the membrane region. Our interaction potential (similar in 
spirit to the Go model[36] introduced before) assigns two values to the energy associated 
with the formation of a native contact, e and ei. The model captures the tendency 
to form native contacts. In addition, in order to account for the effective interactions 
between the membrane and the protein, the model assigns a lower energy, — e, to the 
contact which occurs in the same region as in the native state structure compared to — ei 
when the contact is formed but in the wrong enviroment. This mechanism proves to be 
crucial in driving the insertion of the protein across the membrane. 
When e — ei, the protein does not recognize the presence of the interface-membrane 
region and the full rotational symmetry is restored. The difference in the parameters 
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(e — ei) determine the amount of tertiary structure formation outside the membrane. 
Our results are independent of the precise values of the energy parameters e and ei 
(e > ei) as long as they are not too close to each other. 

Our simulations have been performed with ei =0.1 and e = 1. In order to account for 
the chirality of the TMP, a potential for the pseudodihedral angle ai between the Cq 
atoms in a helix corresponding to four successive locations is added which biases the 

helices to be in their native state structure. 

The thermodynamics and the kinetics of the model were studied by a Monte Carlo method 
for polymer chains carried out in the continuum. The efficiency of the program (usually 
low for continuum calculations) has been increased by full use of the link cell technique 
[94] and by the multiple Markov chain method, a new sampling scheme, which has been 
proven to be particulary efficient in exploring the low temperature phase diagram for 
polymers [95]. In our simulation 20 different temperatures (measured in dimensionless 
units) ranging from T = 2 to T = 0.17 have been studied. 

The free energy difference J^b — between two states A and B has been estimated as 
the reversible work that has to be done in order to go from A to B [91]. The free energy 
differences obtained with this method are accurate to within ~ 0.1 Tc for the various 
states whereas the free energy barriers are accurate within ^ 0.5 Tc . This error takes 
into account possible hysteresis effects due to the finite simulation time. 
The structural similarity between the system equilibrated at temperature T and the 
native state is shown in Figure 16b in terms of the average fraction of native state 
contacts as a function of T and partitioned depending on their positions with respect to 
the membrane. The three curves correspond respectively to the average fraction of native 
contacts inside (qm) , outside (qi,) and across ((/,) the membrane. All these curves, well 
separated at high T, collapse for T below the transition temperature Tq ^ 0.6, indicating 
a cooperative effect in the folding. On monitoring the free energy as a function of the 
energy around Tc, one observes additional local minima (besides those corresponding to 
the unfolded and folded states) suggesting the presence of an intermediate. 
The intermediate is characterized by having the two helices almost completely formed 
but not yet correctly inserted across the membrane. The presence of these extra minima 
suggests that non-constitutive membrane proteins would fold with multi-state kinetics 
corresponding to on-pathway intermediates. To establish the nature of the dominant 
folding pathway, we have performed a detailed analysis of the folding kinetics. Each 
independent kinetic folding simulation was started with the equilibrated denaturated 
state at T* = 2.5 . The protein is placed initially outside the membrane in the interface 
region [77], at a distance comparable to the average size of the denatured protein and 
then suddenly quenched to a temperature (T = 0.4) well below the transition tempera- 
ture. This case simulates the folding kinetics of non-constitutive membrane; proteins, i.e. 
proteins that do not need a translocon providing a 'tunnel' through which the protein 
is injected into the lipid bilayer. Folding to the native state occurs mainly through the 
states depicted in Figure 17a with the dominant pathways shown in Figure 17b. 
In all the pathways, the system goes from the unfolded state, U to state HI in which 
80% of the secondary structure is formed (see q in Figure 18c) and disposed horizontally 
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State HO 



State HI 



State HV CT 




Native State (N) 



State I 



a) 




b) 

Fig. 17. - Schematic representation of states encountered by our model protein during the folding 
process. In a) the red cylinders denote a-helices that reside within the membrane in the native 
state. The region inside the membrane is in turquoise whereas the rest represents the interface 
region [77]. State U denotes the denatured state of the protein, HO is a state in which the 
helices have been formed but are not yet inside the membrane whereas HI corresponds to a 
similar state but with the helices completely embedded in the membrane without any inter- 
helical contacts. HV denotes an obligatory intermediate and A'' depicts the native state. The 
state {/} represents an ensemble of long lived conformations in which helices are formed inside 
the membrane with several inter-helical contacts, but with the two a-helices still incorrectly 
positioned. In b) the schematic pathways to the native state are shown. 
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along the interface. The free energy of this state (measured with respect to the free 
energy of the fully folded state) is ~ 2.4 Tc- This state corresponds to the formation of 
around 70 % of the membrane contacts. The average time thi to reach state HI is of 
the order of 500 Monte Carlo steps (sec Figures 18; each Montc-Carlo step corresponds 
to 50000 attempted local deformations.). State HI turns out to be an obligatory on- 
pathway intermediate of the folding kinetics for non-constitutive MP in agreement with 
the general argument mentioned above. Once the protein reaches state HI, it undergoes a 
relatively slow process of self-arrangement in order to insert and assemble the secondary 
structures across the membrane. This process is the rate-limiting step of the folding 
process, since it involves the translocation, through the lipidic layer, of a substantial 
number of hydrophilic residues. Among the possible pathways, starting from HI, the 
most frequent (60% of the cases) and the fastest turn out to be C/ ^ HI HV N. 
A quantitative characterization of this dominant pathway is presented in Figures 18 
(for a single folding process). The intermediate HV is characterized by having one a 
helix inserted across the membrane and is reached in an average period corresponding 
to a significant fraction of the total folding time (see Figure 16). The free energy in 
this state is ~ 0.98 Tc- The free energy barrier between HI and HV is at ~ 4.31 Tc 
(hence, the rate constant of the transition HI —>■ HV is proportional to kni^HV = 
exp(- (4.31- 2.4) Tc/T)). 

The last part of the folding process corresponds to the insertion of the second helix 

and the assembly of the two secondary structures into the native state structure. This 
process lasts approximately one third of the folding time along the pathway U — > HI — » 
HV — » N. The quasistatic free energy barrier between HV and the folded state is ~ 
1.66 Tc- The rate constant of the transition HV N is, therefore, proportional to 
exp (— (1.66 — 0.98) Tc/T). These results are consistent with the time scales observed in 
the unconstrained folding dynamics. At the end, the protein is completely packed, {qm 
saturates to 1 (Figure 18a) and the helices are correctly positioned across the membrane 
(note the second jump in the z coordinate of the center of mass in Figure 18b). 
Much slower dynamics can occur when non-obligatory intermediates are visited by the 
system. These long lived states ({/} in Figure 17a) involve a distribution of misfolded 
regions that trap the system and are characterized by having most of the inter-helical 
contacts formed (assembly of the secondary structures) but with the two a-helices still 
incorrectly positioned. Note, for example, that in states {/} , only transmembrane con- 
tacts and some contacts outside the membrane are misplaced and they account for only 
a small fraction of the native state energy. For this reason, in the states {/}, the free 
energy is ~ 1.44 Tc, only slightly higher than the free energy of HV. The folding can pro- 
ceed from {/} either by disentangling the two helices and passing through the obligatory 
intermediate HV, or by the simultaneous translocation through the membrane of the 
two helices. These processes, however, entail the crossing of a big free energy barrier (~ 
5.18 Tc for the first process and 6.1 Tc for the second) and happen with low probability. 
Indeed, at suflaciently low temperatures, the loss in energy of the interhelical contacts 
is not compensated by the gain in the configTirational entropy due to the uncoupling of 
the a— helices. Thus below the folding temperature, I-states act as trapping regions for 
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Fig, 18, - Typical time dependence of different parameters as a function of the Monte-Carlo 
steps for the pathway U HI HV N. Fraction of native contacts inside the membrane 
(a), normalized z-coordinate of the center of mass of the protein (with respect to that of the 
native state conformation) (b) and overall fraction of native helical contacts (c), (d) Protein 
conformations at different times during the folding. The colours red, green and blue have the 
same significance as in Figure la with the grey bonds being ones crossing the membrane. 



the system and when trapped, the protein spends most of the time during folding in this 
state. 

In summary, we have shown that a topology based model can lead to a vivid picture of 
the folding process. Our approach predicts a folding process involving multiple pathways 
with a dominant folding channel. Further details not captured by the present approach 
may of course change the quantitative nature of the results. However, the model, which 
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captures the bare essentials of a membrane protein, ought to provide a zeroth order 
picture of the folding process. Also, as experimental data becomes available, the results 
could be benchmarked with models of this type to determine the nature of the other 
factors that matter. 
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